Remote config and join flow (LLP 0025)#98
Merged
Conversation
Specifies the client half of centrally-managed gateway configuration: join sequence, config pull loop, seed-config mode, staged-restart apply semantics, hash-pinned install-on-config, and last-known-good rollback with post-apply probation. Supporting amendments: - LLP 0003: config apply engine added to the core-owns list - LLP 0011: join added as a non-interactive entry point - LLP 0017: staged restart for config replacement + installer relaunch requirement - central proto.md: policy tokens, running-config If-None-Match convergence semantics, hash-pinned plugins, 404 demoted to legacy - LLP 0000: subsystem map updated - notes-archive: round-1 review record for LLP 0022 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Origin master took 0022 for iceberg-export-partitioning (#91) while the join-flow spec was drafted on a local branch; renumber to the next free slot and update all cross-references. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Kernel side: - Config apply engine (src/core/config/apply.js): validate -> install pinned plugins -> persist to A/B slots -> atomic pointer flip -> staged restart, with last-known-good rollback, bad-etag re-apply backoff, and structured rollback reasons. The served etag lives in a per-slot sidecar written before the flip, so the document and its etag commit on the same rename in both directions. - Narrow ctx.configControl facade (stage / confirmPoll / runningEtag) exposed only in daemon mode; CLI boots leave it undefined so transport plugins keep their pull loops off. - Kernel-owned probation watchdog: window max(3 x poll_interval, 120s) sized from the staged document, cleared only by a confirmed poll, expiry also evaluated at boot before plugin activation (crashloop case), orphaned markers from a crash before the flip are discarded. - Staged restart: DAEMON_RESTART_EXIT_CODE (75) for foreground invokers; service managers already relaunch via KeepAlive / Restart=always (now pinned by tests). Sinks are closed on shutdown so plugin timers stop. - Hash-pinned install-on-config through the LLP 0007 path; pin verified against the staged artifact before the install commits. Bundled first-party plugins: version strict, hash skipped. - Config shape: plugin entries accept version / artifact_hash / source pins. - hypaware status: remote-config section (probation, last rollback + reason, remembered bad etag, running etag) in text and JSON. - hypaware join <url> [token] [--token-file|stdin] [--no-daemon]: writes the mode-0600 seed config and runs the daemon install — a wrapper over the two existing steps, not a second code path. Central plugin (transport only): - Config pull loop (central/src/config_client.js): immediate pull on bootstrap success, steady timer, If-None-Match always the running config's etag, 401 refresh-retry, 404 legacy backoff, 429/503 Retry-After + linear backoff, 1 MiB body cap. - Dropped the never-wired config_etag_path option (the etag is kernel-managed and read through the facade). Settled LLP 0023 open questions: poll default 300s, probation floor 120s, max document size 1 MiB. proto.md sidecar wording updated; the restart exit code recorded in LLP 0017. Tests: apply-engine state machine (18), pull loop (10), join command (5), installer relaunch defaults; join_flow_remote_config hermetic smoke drives seed -> bootstrap -> pull -> apply -> restart -> probation clear against a stub server with convergence assertions. Note: central_forward_outbox was already failing on origin/master (empty ingest rows) before this change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Codex | Install-on-config validation-before-install (major, apply.js:311/322, apply_deps.js:51-74) | Risks #1; Direct callers installPinnedPlugins |
| Codex | 1 MiB cap enforced after full buffering (major, config_client.js:146-147) | Risks #3; Concurrency surface (response buffering) |
| Codex | Poll fetch lacks abort/timeout, shutdown can hang (major, config_client.js:232-259, runtime.js:421) | Risks #2; Concurrency surface (pull loop) |
| Claude | typecheck broken: bad type-import path (major, test/core/config-apply.test.js:69) | Risks #5 |
| Claude | typecheck: literal-type exit-code comparisons (minor, test/core/daemon.test.js:199) | Risks #5; Direct callers DAEMON_RESTART_EXIT_CODE |
| Claude | Pin-enforcement logic untested (major, apply_deps.js:74-141) | Risks #4; Config field chain artifact_hash |
| Claude | Inline import() types ban violated (minor, apply.js:265 et al.) |
Targets (new config files) |
| Claude | Size-cap comment/spec claims pre-buffering drop (minor, config_client.js:18) | Risks #3 |
| Claude | 429/503 + parseRetryAfter untested (minor, config_client.js:209-216) |
Config field chain config_poll_interval_seconds |
| Claude | closeAllSinks comment contradicts lazy-refresh comments (nit, runtime.js:611-613) |
Direct callers closeAllSinks |
Codex review
Fix Validations
No explicit bug-fix claims were in scope; reviewed as a feature implementation.
Findings
1) Behavioral Correctness
- Severity: major
- Confidence: high
- Evidence: src/core/config/apply.js:311, src/core/config/apply.js:322, src/core/config/apply_deps.js:51, src/core/config/apply_deps.js:56, src/core/config/apply_deps.js:74
- Why it matters: Install-on-config cannot reliably work for a remotely served plugin that is not already installed, because catalog-backed validation runs before the pinned install path.
- Suggested fix: Shape-parse first, install pinned non-bundled plugins from the parsed plugin entries, rebuild the catalog, then run full
validateConfig; add a test with a non-bundled, initially absent pinned plugin.
6) Security Surface
- Severity: major
- Confidence: high
- Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:15, hypaware-core/plugins-workspace/central/src/config_client.js:146, hypaware-core/plugins-workspace/central/src/config_client.js:147
- Why it matters: The advertised 1 MiB transport cap is enforced only after
response.text()buffers the full body, so a bad or compromised central endpoint can still force unbounded memory use. - Suggested fix: Read
response.bodyas a stream with a byte counter and cancel once the cap is exceeded; optionally pre-reject oversizedContent-Length.
7) Resource Lifecycle & Cleanup
- Severity: major
- Confidence: medium
- Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:232, hypaware-core/plugins-workspace/central/src/config_client.js:234, hypaware-core/plugins-workspace/central/src/config_client.js:252, hypaware-core/plugins-workspace/central/src/config_client.js:259, src/core/daemon/runtime.js:421
- Why it matters:
stop()waits for an in-flight poll, but the fetch has no abort signal or timeout; daemon shutdown or staged restart can hang indefinitely behind a stalled config GET. - Suggested fix: Create an
AbortControllerper poll, passsignalto fetch, abort it instop(), and add a bounded request timeout plus a test with a never-resolving fetch.
No Finding
- Contract & Interface Fidelity
- Change Impact / Blast Radius
- Concurrency, Ordering & State Safety
- Error Handling & Resilience
- Release Safety
- Test Evidence Quality
- Architectural Consistency
- Debuggability & Operability
Evidence Bundle
- Changed hot paths: central config pull loop; kernel config apply engine; pinned plugin install deps; daemon boot/probation/restart; join CLI; remote-config status.
- Impacted callers: hypaware-core/plugins-workspace/central/index.js:60; src/core/daemon/runtime.js:213; src/core/daemon/runtime.js:260; src/core/daemon/runtime.js:421; src/core/cli/core_commands.js:2811.
- Impacted tests: test/core/config-apply.test.js:91; test/plugins/central-config-pull.test.js:85; test/core/join-command.test.js:41; test/core/daemon.test.js:176; hypaware-core/smoke/flows/join_flow_remote_config.js:47.
- Unresolved uncertainty: I did not inspect
validateConfiginternals because of the 5-file review cap; finding 1 is based on the changed pipeline and its explicitknownPluginscatalog dependency. Tests were not run.
Claude review
Claude review
PR breaks npm run typecheck: broken type-import path in new test
- Severity: major
- Confidence: 95
- Evidence: test/core/config-apply.test.js:69
- Why it matters: The inline
import('../../../collectivus-plugin-kernel-types.d.ts')resolves one directory too high (file is attest/core/, repo root is../../), producing TS2307 and turning a previously cleannpm run typecheckred (verified: 4 errors at PR head, 0 at base). - Suggested fix: Add
PluginConfigInstanceto a top-of-file@importfrom'../../collectivus-plugin-kernel-types.d.ts'and reference the bare name in the@param.
PR breaks npm run typecheck: literal-type exit-code comparisons flagged by tsc
- Severity: minor
- Confidence: 90
- Evidence: test/core/daemon.test.js:199
- Why it matters:
DAEMON_RESTART_EXIT_CODEis inferred as the literal type75, so!== 0 && !== 1 && !== 2raises three TS2367 "comparison appears to be unintentional" errors, contributing to the now-failingnpm run typecheckgate. - Suggested fix: Widen the value before comparing (e.g.
const code = Number(DAEMON_RESTART_EXIT_CODE)or annotate/** @type {number} */) so the distinctness assertions typecheck.
Pin-enforcement logic in apply_deps.js has no traditional tests
- Severity: major
- Confidence: 85
- Evidence: src/core/config/apply_deps.js:74-141
- Why it matters: The apply engine tests mock
installPinnedPluginsentirely, so the real deterministic pin-validation decisions — thebundled_version_mismatchstrict-equality check, the satisfied-lock skip predicate, the confirm-callback hash rejection, and theremote_install_rejected→artifact_hash_mismatcherrorKind mapping — are asserted nowhere, and the hash pin is the PR's core "nothing may substitute code after the config was authored" security property; CLAUDE.md requires traditional tests for config-validation logic like this. - Suggested fix: Add
test/core/config-apply-deps.test.jscovering: pinned bundled plugin with mismatched version →bundled_version_mismatch; matching bundled version accepted without install; already-installed lock entry matching version+hash skipped; hash-pin mismatch →artifact_hash_mismatch. Alternatively make the install/discover dependencies injectable like the engine'sConfigApplyDeps.
Inline import() types in new code despite explicit ban
- Severity: minor
- Confidence: 90
- Evidence: src/core/config/apply.js:265, test/core/config-apply.test.js:55, test/core/config-apply.test.js:69, hypaware-core/smoke/flows/join_flow_remote_config.js:330
- Why it matters: CLAUDE.md Code Style says "Never use inline
import('...')types. Declare type imports at the top of the file with@importJSDoc comments" — these are all new files that even have@importblocks at the top, so the rule was known and inconsistently applied. - Suggested fix: Add
ConfigApplyErrorKindto apply.js's existing top-of-file@importblock; addPinnedInstallResult/PluginConfigInstanceto the test's@importblock; add@import { AddressInfo } from 'node:net'to the smoke flow.
Transport size-cap comment and spec claim a pre-buffering drop the code doesn't implement
- Severity: minor
- Confidence: 80
- Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:18 (vs config_client.js:146-147 and llp/0023 "Config pull loop")
- Why it matters: The comment says "an oversized body should be dropped before it is buffered whole," but the implementation calls
await response.text()— buffering the entire response — before checking size, so the stated memory bound does not exist at the transport layer and the spec/code disagreement violates the "keep refs honest" discipline. - Suggested fix: Either stream the body and abort once 1 MiB is exceeded (or check
Content-Lengthbefore reading), or soften the comment and the LLP 0023 transport parenthetical to say the cap is checked after download but before parse/stage.
429/503 throttle handling and parseRetryAfter have no tests
- Severity: minor
- Confidence: 80
- Evidence: hypaware-core/plugins-workspace/central/src/config_client.js:209-216, 268-275
- Why it matters:
test/plugins/central-config-pull.test.jscovers 200/304/401/404/oversize/bad-JSON/transport-error branches but not the 429/503 branch orparseRetryAfter, pure deterministic parsing with three distinct paths where a wrong negative/NaN result would silently produce a zero-delay poll loop against a throttling server. - Suggested fix: Add cases: 429 with
Retry-After: 7schedules without confirming the poll and logsconfig_poll_throttledwithretry_after_seconds: 7; 503 with an HTTP-date and with a garbage header falls back to the backoff ladder.
closeAllSinks comment contradicts the PR's own "identity refresh is lazy" comments
- Severity: nit
- Confidence: 85
- Evidence: src/core/daemon/runtime.js:611-613
- Why it matters: The new comment says the central plugin's "identity refresh timers stop in its
close()", but two other comments added in the same PR (sink.js:103-105, config_client.js:38-40) state identity refresh is lazy and has no timer, sending a future maintainer hunting for a timer that does not exist. - Suggested fix: Reword the
closeAllSinksJSDoc to mention only the config pull loop, e.g. "The central plugin's config pull loop stops in itsclose()(identity refresh is lazy and has no timer)".
Reports: /Users/phil/workspace/hypaware/.git/worktrees/remote-config-join-flow/dual-review/pr-98
Apply engine (LLP 0023): - Reorder stage() to shape-check -> install pinned plugins -> full validation, so a served config can name a not-yet-installed plugin (catalog-backed validation only knows a plugin once it is installed). LLP 0023 install-on-config section updated with the ordering rationale. Config pull transport: - Enforce the 1 MiB cap before buffering: oversized Content-Length is rejected without reading, chunked bodies stream through a byte counter that cancels at the cap. - Per-poll AbortController with a 30s request deadline covering request and body read; stop() aborts an in-flight poll after a 1s drain grace so a stalled config GET cannot wedge daemon shutdown. Tests: - New test/core/config-apply-deps.test.js covering the real pin enforcement: bundled version mismatch, bundled hash exemption, satisfied-lock skip, artifact hash mismatch via a local git fixture, and install-then-validate over a fresh catalog. - Pull-loop tests for Content-Length pre-reject, chunked cap cancel, stop() abort, request deadline, 429/503 Retry-After, and parseRetryAfter (now exported). - Engine ordering test (install before validate) and shape-gate test. Typecheck and style: - Fix broken inline type-import path and literal exit-code comparisons that broke npm run typecheck; replace remaining inline import() types with top-of-file @import blocks. - Reword closeAllSinks JSDoc: identity refresh is lazy and has no timer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Author
|
Addressed the dual-review findings in 17665e7: Majors
Minors/nits
Verified: |
An unref'd timer can't fire once a wedged fetch is the only live handle: the event loop drains, so the deadline/grace abort never happens — in CI this surfaced as node:test cancelling the pull-loop tests with 'Promise resolution is still pending but the event loop has already resolved'. Both timers are cleared as soon as the poll settles, and the loop's policy is no-unref anyway. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the client half of centrally-managed gateway configuration, per LLP 0023 (grilled + reviewed before implementation; review record in
notes-archive/llp-reviews/).What this adds
Spec + supporting docs (first two commits)
proto.md(policy tokens, running-configIf-None-Matchconvergence semantics, hash-pinned plugins, 404 demoted to legacy).Kernel
src/core/config/apply.js— the config apply engine: validate → install pinned plugins → persist to A/B slots → atomic pointer flip → staged restart; last-known-good rollback, single remembered bad etag with re-apply backoff, structured rollback reasons. Each slot carries its served etag in a sidecar written before the flip, so the document and its etag commit on the same rename in both directions (apply and rollback).ctx.configControlfacade (stage/confirmPoll/runningEtag), present only in daemon mode — CLI boots never fire config polls.max(3 × poll_interval_seconds, 120s)sized from the staged document; cleared only by a confirmed authenticated poll; expiry also evaluated at boot before plugin activation (crashloop case); orphaned markers from a crash before the flip are discarded.DAEMON_RESTART_EXIT_CODE(75) for foreground invokers; installers already renderKeepAlive/Restart=always, now pinned by tests. Sinks are closed on daemon shutdown. Restart requests that race daemon startup are parked and replayed rather than dropped.hypaware join <url> [token] [--token-file|stdin] [--no-daemon]— writes the mode-0600 seed config and runs the non-interactive daemon install; a wrapper over the two existing steps, not a second code path.hypaware statusgains a remote-config section: probation state, last rollback + structured reason, remembered bad etag, running etag (text + JSON).Central plugin (transport only)
central/src/config_client.js— the config pull loop: immediate pull on bootstrap success, steady timer,If-None-Matchalways reflecting the running config, 401 refresh-and-retry, legacy-404 backoff, 429/503Retry-After+ linear backoff, 1 MiB body cap.config_etag_pathoption (the running etag is kernel-managed, read through the facade).Testing
join(5), installer relaunch defaults + exit-code distinctness. Full suite: 865 pass / 0 fail.join_flow_remote_config: join → seed boot → identity bootstrap → pull (200) → kernel apply → staged restart (exit 75) → relaunch on the served config → probation cleared by a 304, against a stub central server — asserting wire-level convergence (If-None-Matchtransitions), token retirement from disk, seed preserved as the rollback slot, and theconfig.apply/config.applied/config.probation_clearedtelemetry.daemon_foreground_start_stop,core_boot_noop,gateway_codex_captureall pass.@refannotations validate against the LLP corpus.Notes for reviewers
central_forward_outboxfails identically on a clean origin/master checkout (empty ingest rows) — pre-existing, likely from feat(sinks): partition iceberg exports by day with conversation sort #91's day-partitioning change, not touched here.hypaware-serverLLP 0009) lands separately and ships dark; nothing here blocks on it except end-to-end testing.🤖 Generated with Claude Code